`recode` reference manual

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.

1 Conversion of files between different charsets and usages

This ‘recode’ program has the purpose of converting files between various character sets and usages. When exact transliterations are not possible, as it is often the case, the program may get rid of the offending characters or fall back on approximations.

Let us coin the term charset to represent, without distinction, a character set “per se” or a particular usage of a character set. This program recognizes or produces a little more than a dozen of such charsets. Since it can convert each charset to almost any other one, more than one hundred different conversions are possible.

This tool pays special attention to superimposition of diacritics, particularily for French representation. This orientation is mostly historical, it does not impair the usefulness, generality or extensibility of the program. In fact, this program evolved for several years, through several programming languages and computer brands, because I used a lot of different coding for French characters on different machines, each system having its own peculiarities.

You may find in this document:

— The Detailed Node Listing — Character sets recognized of produced
1.1 How to use this program
1.2 Character sets recognized of produced
1.3 Easy French conventions
1.4 Internal aspects
1.5 Future things
1.2.1 ASCII 8-bits for Apple’s Macintosh
1.2.2 ASCII 7-bits, <BS> to overstrike
1.2.3 ASCII “bang bang”, escapes are ! and !!
1.2.4 ASCII 8-bits as seen by Perkin Elmer
1.2.5 ASCII 8-bits a seen by Control Data
1.2.6 ASCII 6/12 from NOS, escapes are ^ and @
1.2.7 EBCDIC with no further comments
1.2.8 ASCII without diacritics nor underline
1.2.9 ASCII 8-bits for IBM’s PC
1.2.10 ASCII for the Unisys’ ICON		ASCII as coded on Unisys’ ICON
1.2.11 ASCII with LaTeX codes
1.2.12 ASCII extended by Latin Alphabet 1
1.2.13 ASCII with easy French conventions
ASCII 7-bits, <BS> to overstrike
1.2.2.1 Commented ASCII
1.2.2.2 Octal ASCII
1.2.2.3 Decimal ASCII
1.2.2.4 Hexadecimal ASCII
ASCII “bang bang”, escapes are ! and !!
1.2.3.1 Control Data’s Display Code
ASCII extended by Latin Alphabet 1
1.2.12.1 Commented Latin-1
1.2.12.2 Octal Latin-1
1.2.12.3 Decimal Latin-1
1.2.12.4 Hexadecimal Latin-1
Easy French conventions
1.3.1 French quotes		How to type them.
1.3.2 Latin ligatures		They are not representable.
1.3.3 Diacritics		How to type them, things to know.
1.3.4 List of words ending with diaeresis
1.3.5 When, How and Who.
Internal aspects
1.4.1 Overall organization		Overall organization of the program.
1.4.2 Internal vs external piping		Distinction between internal or external piping.
1.4.3 Some limitations		A few limitations of the choosen implementation.
1.4.4 Adding new charsets		How to proceed in adding new charsets.

1.1 How to use this program

The general format of the program call is:

recode [OPTION]… [before]:[after] [file]…

Each file file will be read assuming it is coded with charset before, it will be recoded over itself so to use the charset after. If there is no such FILE, the program rather acts as a filter and recode standard input to standard output.

The available options are:

-C

Given this option, all other parameters and options are ignored. The program prints briefly the Copyright and copying conditions. See the file ‘COPYING’ in the distribution for full statement of the Copyright and copying conditions.

-c

With Easy French conventions, use the column : instead of the double-quote " for marking diaeresis. See: See section Easy French conventions.

-f

This option is recognized, but otherwise ignored. Eventually, this option will be necessary for a file to be replaced by its recoded contents, if it is found that the recoding is not fully reversible. In this version, the replacement is unconditionnaly done.

-i

When the recoding requires a combination of two or more elementary recoding steps, this option forces many passes over the data, using intermediate files between passes. This is the default behaviour when files are recoded over themselves. If this option is selected in filter mode, that is, when the program reads standard input and writes standard output, it might take longer for programs further down the pipe chain to start receiving some recoded data.

-o

When the recoding requires a combination of two or more elementary recoding steps, this option forces the creation of a chain of program instances initiated through the popen(3) library call, all operating in parallel. In filter mode, at cost of some overhead, recoded data will be available soon after the program starts, even if many elementary recoding steps are required.

If, at installation time, the popen(3) call is said to be unavailable, selecting option -o is equivalent to selecting option -i.

-p

When the recoding requires a combination of two or more elementary recoding steps, this option forces the program to fork itself into a few copies interconnected with pipes, using the pipe(2) system call. All copies of the program operate in parallel. This method is similar to the method used through option -o, but is slightly more efficient. This is the default behaviour in filter mode. If this option is used when files are recoded over themselves, this should save some disk accesses and some disk space, at cost of more system overhead.

If, at installation time, the pipe(2) call is said to be unavailable, selecting option -p is equivalent to selecting option -o. If both pipe(2) and popen(3) are unavailable, selecting option -p is equivalent to selecting option -i.

-t

The touch option is meaningful only when files are recoded over themselves. Without it, the timestamps associated with files are preserved, to reflect the fact that changing the code of a file does not really alter its informational contents. When the user wants the recoded files to be timestamped at the recoding time; this option inhibits the automatic protection of the timestamps.

-v

Before proceeding, the program will print on ‘stderr’ the list and order of application of elementary conversions which are planned to achieve the global conversion. Then, the program will print on ‘stderr’ one message per file recoded, so to let the user informed of the progress of its command.

One or both of the before or after keywords may be omitted, but the colon which separates them cannot. An omitted keyword implies the usual or default code in usage on the system where this program is installed. Usually, this default code is latin1 for UNIX systems or ibmpc for MS-DOS machines, but it might be changed to any other supported code when this program is installed.

1.2 Character sets recognized of produced

The possible values for charset before or charset after are provided as the keys in the following menu.

1.2.1 ASCII 8-bits for Apple’s Macintosh

The file has been obtained or is aimed to a Macintosh micro-computer from Apple. This is an eight bit code. The file is the data fork only.

1.2.2 ASCII 7-bits, <BS> to overstrike

The file is straight ASCII, seven bits only. According to the definition of ASCII: diacritics are applied by a sequence of three characters: the letter, one <BS>, the diacritic mark. We deviate slightly from this by exchanging the diacritic mark and the letter so, on a screen device, the diacritic will disappear and let the letter alone. At recognition time, both methods are acceptable.

The French quotes are coded by the sequences: < <BS> " or " <BS> < for the opening quote and > <BS> " or " <BS> > for the closing quote. This artifical convention was inherited in straight ascii from habits around bangbang entry, and is not well known. But we decided to stick to it so that ascii charset will not loose French quotes.

1.2.2.1 Commented ASCII

oct dec hex     name    description

000   0  0      nul     null character
001   1  1      soh     start of header
002   2  2      stx     start of text
003   3  3      etx     end of text
004   4  4      eot     end of transmission
005   5  5      enq     enquiry
006   6  6      ack     acknowledge
007   7  7      bel     bell
010   8  8      bs      back space
011   9  9      ht      horizontal tab
012  10  a      nl      new line
013  11  b      vt      vertical tab
014  12  c      np      new page
015  13  d      cr      carriage return
016  14  e      so      shift out
017  15  f      si      shift in
020  16 10      dle     data link escape
021  17 11      dc1     device control 1
022  18 12      dc2     device control 2
023  19 13      dc3     device control 3
024  20 14      dc4     device control 4
025  21 15      nak     negative acknowledge
026  22 16      syn     synchronize
027  23 17      etb     end of transmitted block
030  24 18      can     cancel
031  25 19      em      end of medium
032  26 1a      sub     substitute
033  27 1b      esc     escape
034  28 1c      fs      file separator
035  29 1d      gs      group separator
036  30 1e      rs      record separator
037  31 1f      us      unit separator
040  32 20      sp      space

177 127 7f      del     delete

1.2.2.2 Octal ASCII

000 nul  020 dle  040 sp 060 0  100 @  120 P  140 `  160 p 
001 soh  021 dc1  041 !  061 1  101 A  121 Q  141 a  161 q 
002 stx  022 dc2  042 "  062 2  102 B  122 R  142 b  162 r 
003 etx  023 dc3  043 #  063 3  103 C  123 S  143 c  163 s 
004 eot  024 dc4  044 $  064 4  104 D  124 T  144 d  164 t 
005 enq  025 nak  045 %  065 5  105 E  125 U  145 e  165 u 
006 ack  026 syn  046 &  066 6  106 F  126 V  146 f  166 v 
007 bel  027 etb  047 '  067 7  107 G  127 W  147 g  167 w 
010 bs   030 can  050 (  070 8  110 H  130 X  150 h  170 x 
011 ht   031 em   051 )  071 9  111 I  131 Y  151 i  171 y 
012 nl   032 sub  052 *  072 :  112 J  132 Z  152 j  172 z 
013 vt   033 esc  053 +  073 ;  113 K  133 [  153 k  173 { 
014 np   034 fs   054 ,  074 <  114 L  134 \  154 l  174 | 
015 cr   035 gs   055 -  075 =  115 M  135 ]  155 m  175 } 
016 so   036 rs   056 .  076 >  116 N  136 ^  156 n  176 ~ 
017 si   037 us   057 /  077 ?  117 O  137 _  157 o  177 del

1.2.2.3 Decimal ASCII

  0 nul  16 dle  32 sp 48 0  64 @  80 P   96 `  112 p 
  1 soh  17 dc1  33 !  49 1  65 A  81 Q   97 a  113 q 
  2 stx  18 dc2  34 "  50 2  66 B  82 R   98 b  114 r 
  3 etx  19 dc3  35 #  51 3  67 C  83 S   99 c  115 s 
  4 eot  20 dc4  36 $  52 4  68 D  84 T  100 d  116 t 
  5 enq  21 nak  37 %  53 5  69 E  85 U  101 e  117 u 
  6 ack  22 syn  38 &  54 6  70 F  86 V  102 f  118 v 
  7 bel  23 etb  39 '  55 7  71 G  87 W  103 g  119 w 
  8 bs   24 can  40 (  56 8  72 H  88 X  104 h  120 x 
  9 ht   25 em   41 )  57 9  73 I  89 Y  105 i  121 y 
 10 nl   26 sub  42 *  58 :  74 J  90 Z  106 j  122 z 
 11 vt   27 esc  43 +  59 ;  75 K  91 [  107 k  123 {
 12 np   28 fs   44 ,  60 <  76 L  92 \  108 l  124 |
 13 cr   29 gs   45 -  61 =  77 M  93 ]  109 m  125 }
 14 so   30 rs   46 .  62 >  78 N  94 ^  110 n  126 ~
 15 si   31 us   47 /  63 ?  79 O  95 _  111 o  127 del

1.2.2.4 Hexadecimal ASCII

 00 nul  10 dle  20 sp 30 0  40 @  50 P  60 `  70 p 
 01 soh  11 dc1  21 !  31 1  41 A  51 Q  61 a  71 q 
 02 stx  12 dc2  22 "  32 2  42 B  52 R  62 b  72 r 
 03 etx  13 dc3  23 #  33 3  43 C  53 S  63 c  73 s 
 04 eot  14 dc4  24 $  34 4  44 D  54 T  64 d  74 t 
 05 enq  15 nak  25 %  35 5  45 E  55 U  65 e  75 u 
 06 ack  16 syn  26 &  36 6  46 F  56 V  66 f  76 v 
 07 bel  17 etb  27 '  37 7  47 G  57 W  67 g  77 w 
 08 bs   18 can  28 (  38 8  48 H  58 X  68 h  78 x 
 09 ht   19 em   29 )  39 9  49 I  59 Y  69 i  79 y 
 0a nl   1a sub  2a *  3a :  4a J  5a Z  6a j  7a z 
 0b vt   1b esc  2b +  3b ;  4b K  5b [  6b k  7b { 
 0c np   1c fs   2c ,  3c <  4c L  5c \  6c l  7c | 
 0d cr   1d gs   2d -  3d =  4d M  5d ]  6d m  7d } 
 0e so   1e rs   2e .  3e >  4e N  5e ^  6e n  7e ~ 
 0f si   1f us   2f /  3f ?  4f O  5f _  6f o  7f del

1.2.3 ASCII “bang bang”, escapes are ! and !!

This is the local code in use on Cybers at Universite de Montreal, which grave and serious people there prefer to name ASCII code display. This code is also known as Bang-bang. It is based on a six bits character set in which capitals, French diacritics and a few others are coded using an ! escape followed by a single character, and control characters using a double ! escape followed by a single character.

The routines given here presume that the six bits code is already expressed in ASCII by the communication channel, with embedded ASCII ! escapes.

Here is a table showing which characters are being used to encode each ASCII character.

000 !!@  020 !!P  040    060 0  100 @   120 !P  140 !@ 160 P
001 !!A  021 !!Q  041 !" 061 1  101 !A  121 !Q  141 A  161 Q
002 !!B  022 !!R  042 "  062 2  102 !B  122 !R  142 B  162 R
003 !!C  023 !!S  043 #  063 3  103 !C  123 !S  143 C  163 S
004 !!D  024 !!T  044 $  064 4  104 !D  124 !T  144 D  164 T
005 !!E  025 !!U  045 %  065 5  105 !E  125 !U  145 E  165 U
006 !!F  026 !!V  046 &  066 6  106 !F  126 !V  146 F  166 V
007 !!G  027 !!W  047 '  067 7  107 !G  127 !W  147 G  167 W
010 !!H  030 !!X  050 (  070 8  110 !H  130 !X  150 H  170 X
011 !!I  031 !!Y  051 )  071 9  111 !I  131 !Y  151 I  171 Y
012 !!J  032 !!Z  052 *  072 :  112 !J  132 !Z  152 J  172 Z
013 !!K  033 !![  053 +  073 ;  113 !K  133 [   153 K  173 ![
014 !!L  034 !!\  054 ,  074 <  114 !L  134 \   154 L  174 !\
015 !!M  035 !!]  055 -  075 =  115 !M  135 ]   155 M  175 !]
016 !!N  036 !!^  056 .  076 >  116 !N  136 ^   156 N  176 !^
017 !!O  037 !!_  057 /  077 ?  117 !O  137 _   157 O  177 !_

1.2.3.1 Control Data’s Display Code

Octal display code to graphic       Octal display code to octal ASCII

00  :    20  P    40  5   60  #     00 072  20 120  40 065  60 043
01  A    21  Q    41  6   61  [     01 101  21 121  41 066  61 133
02  B    22  R    42  7   62  ]     02 102  22 122  42 067  62 135
03  C    23  S    43  8   63  %     03 103  23 123  43 070  63 045
04  D    24  T    44  9   64  "     04 104  24 124  44 071  64 042
05  E    25  U    45  +   65  _     05 105  25 125  45 053  65 137
06  F    26  V    46  -   66  !     06 106  26 126  46 055  66 041
07  G    27  W    47  *   67  &     07 107  27 127  47 052  67 046
10  H    30  X    50  /   70  '     10 110  30 130  50 057  70 047
11  I    31  Y    51  (   71  ?     11 111  31 131  51 050  71 077
12  J    32  Z    52  )   72  <     12 112  32 132  52 051  72 074
13  K    33  0    53  $   73  >     13 113  33 060  53 044  73 076
14  L    34  1    54  =   74  @     14 114  34 061  54 075  74 100
15  M    35  2    55      75  \     15 115  35 062  55 040  75 134
16  N    36  3    56  ,   76  ^     16 116  36 063  56 054  76 136
17  O    37  4    57  .   77  ;     17 117  37 064  57 056  77 073

1.2.4 ASCII 8-bits as seen by Perkin Elmer

This charset represents the way Concurrent Computer Corporation (formerly Perkin Elmer) expresses EBCDIC using ASCII.

1.2.5 ASCII 8-bits a seen by Control Data

This charset represents the way Control Data Corporation relates EBCDIC to ASCII. We also select the lower half of this table to do straigth ASCII to EBCDIC conversions, back and forth.

1.2.6 ASCII 6/12 from NOS, escapes are ^ and @

This is one of the charset in use on CDC Cyber NOS systems to represent ASCII, sometimes named NOS 6/12 code for coding ASCII. This code is also known as caret ASCII. It is based on a six bits character set in which small letters and control characters are coded using a ^ escape and, sometimes, a @ escape.

The routines given here presume that the six bits code is already expressed in ASCII by the communication channel, with embedded ASCII ^ and @ escapes.

Here is a table showing which characters are being used to encode each ASCII character.

000  ^5  020  ^#  040     060  0  100 @A  120  P  140  @G  160  ^P
001  ^6  021  ^[  041  !  061  1  101  A  121  Q  141  ^A  161  ^Q
002  ^7  022  ^]  042  "  062  2  102  B  122  R  142  ^B  162  ^R
003  ^8  023  ^%  043  #  063  3  103  C  123  S  143  ^C  163  ^S
004  ^9  024  ^"  044  $  064  4  104  D  124  T  144  ^D  164  ^T
005  ^+  025  ^_  045  %  065  5  105  E  125  U  145  ^E  165  ^U
006  ^-  026  ^!  046  &  066  6  106  F  126  V  146  ^F  166  ^V
007  ^*  027  ^&  047  '  067  7  107  G  127  W  147  ^G  167  ^W
010  ^/  030  ^'  050  (  070  8  110  H  130  X  150  ^H  170  ^X
011  ^(  031  ^?  051  )  071  9  111  I  131  Y  151  ^I  171  ^Y
012  ^)  032  ^<  052  *  072 @D  112  J  132  Z  152  ^J  172  ^Z
013  ^$  033  ^>  053  +  073  ;  113  K  133  [  153  ^K  173  ^0 
014  ^=  034  ^@  054  ,  074  <  114  L  134  \  154  ^L  174  ^1
015  ^   035  ^\  055  -  075  =  115  M  135  ]  155  ^M  175  ^2 
016  ^,  036  ^^  056  .  076  >  116  N  136 @B  156  ^N  176  ^3
017  ^.  037  ^;  057  /  077  ?  117  O  137  _  157  ^O  177  ^4

1.2.7 EBCDIC with no further comments

This charset is the IBM’s external binary coded decimal for interchange coding. This is an eight bits code.

1.2.8 ASCII without diacritics nor underline

This code is ASCII expunged of all diacritics and underlines, as long as they are applied using three character sequences, with <BS> in the middle. Also, despite slightly unrelated, each control character is represented by a sequence of two or three graphic characters. The newline character, however, keeps its functionnality and is not represented.

Note that charset flat is a terminal charset. We can convert to flat, but not from it.

1.2.9 ASCII 8-bits for IBM’s PC

The file was obtained or is aimed towards a PC microcomputer from IBM or any compatible. This is an eight-bit code.

1.2.10 ASCII for the Unisys’ ICON

The file is using Unisys’ ICON way to represent diacritics with 0x19 escape sequences. This is a seven-bit code, even if eight-bit codes can flow through as part of IBM-PC charset.

1.2.11 ASCII with LaTeX codes

This charset is an ASCII file coded to be read by LaTeX or, in certain cases, by TeX.

1.2.12 ASCII extended by Latin Alphabet 1

This charset corresponds to the ISO Latin Alphabet 1. It is an eight-bit code which coincides with ASCII for the lower half.

1.2.12.1 Commented Latin-1

oct dec hex     description

240 160 a0      no-break space
241 161 a1      inverted exclamation mark
242 162 a2      cent sign
243 163 a3      pound sign
244 164 a4      currency sign
245 165 a5      yen sign
246 166 a6      broken bar
247 167 a7      paragraph sign, section sign
250 168 a8      diaeresis
251 169 a9      copyright sign
252 170 aa      feminine ordinal indicator
253 171 ab      left angle quotation mark
254 172 ac      not sign
255 173 ad      soft hyphen
256 174 ae      registered trade mark sign
257 175 af      macron
260 176 b0      degree sign
261 177 b1      plus-minus sign
262 178 b2      superscript two
263 179 b3      superscript three
264 180 b4      acute accent
265 181 b5      small greek mu, micro sign
266 182 b6      pilcrow sign
267 183 b7      middle dot
270 184 b8      cedilla
271 185 b9      superscript one
272 186 ba      masculine ordinal indicator
273 187 bb      right angle quotation mark
274 188 bc      vulgar fraction one quarter
275 189 bd      vulgar fraction one half
276 190 be      vulgar fraction three quarters
277 191 bf      inverted question mark
300 192 c0      capital A with grave accent
301 193 c1      capital A with acute accent
302 194 c2      capital A with circumflex accent
303 195 c3      capital A with tilde
304 196 c4      capital A diaeresis
305 197 c5      capital A with ring above
306 198 c6      capital diphthong A with E
307 199 c7      capital C with cedilla
310 200 c8      capital E with grave accent
311 201 c9      capital E with acute accent
312 202 ca      capital E with circumflex accent
313 203 cb      capital E with diaeresis
314 204 cc      capital I with grave accent
315 205 cd      capital I with acute accent
316 206 ce      capital I with circumflex accent
317 207 cf      capital I with diaeresis
320 208 d0      capital icelandic ETH
321 209 d1      capital N with tilde
322 210 d2      capital O with grave accent
323 211 d3      capital O with acute accent
324 212 d4      capital O with circumflex accent
325 213 d5      capital O with tilde
326 214 d6      capital O with diaeresis
327 215 d7      multiplication sign
330 216 d8      capital O with oblique stroke
331 217 d9      capital U with grave accent
332 218 da      capital U with acute accent
333 219 db      capital U with circumflex accent
334 220 dc      capital U with diaeresis
335 221 dd      capital Y with acute accent
336 222 de      capital icelandic THORN
337 223 df      small german sharp s
340 224 e0      small a with grave accent
341 225 e1      small a with acute accent
342 226 e2      small a with circumflex accent
343 227 e3      small a with tilde
344 228 e4      small a with diaeresis
345 229 e5      small a with ring above
346 230 e6      small diphthong a with e
347 231 e7      small c with cedilla
350 232 e8      small e with grave accent
351 233 e9      small e with acute accent
352 234 ea      small e with circumflex accent
353 235 eb      small e with diaeresis
354 236 ec      small i with grave accent
355 237 ed      small i with acute accent
356 238 ee      small i with circumflex accent
357 239 ef      small i with diaeresis
360 240 f0      small icelandic eth
361 241 f1      small n with tilde
362 242 f2      small o with grave accent
363 243 f3      small o with acute accent
364 244 f4      small o with circumflex accent
365 245 f5      small o with tilde
366 246 f6      small o with diaeresis
367 247 f7      division sign
370 248 f8      small o with oblique stroke
371 249 f9      small u with grave accent
372 250 fa      small u with acute accent
373 251 fb      small u with circumflex accent
374 252 fc      small u with diaeresis
375 253 fd      small y with acute accent
376 254 fe      small icelandic thorn
377 255 ff      small y with diaeresis

1.2.12.2 Octal Latin-1

200    220    240 nsp 260 ++  300 A`  320 DD  340 a`  360 dd 
201    221    241 !!  261 +-  301 A'  321 N~  341 a'  361 n~ 
202    222    242 c|  262 22  302 A^  322 O`  342 a^  362 o` 
203    223    243 ##  263 33  303 A~  323 O'  343 a~  363 o' 
204    224    244 cur 264 @''  304 A"  324 O^  344 a"  364 o^ 
205    225    245 y-  265 uu  305 A+  325 O~  345 a+  365 o~ 
206    226    246 ||  266 pil 306 AE  326 O"  346 ae  366 o" 
207    227    247 $$  267 ..  307 C,  327 xx  347 c,  367 // 
210    230    250 ""  270 ,,  310 E`  330 O/  350 e`  370 o/ 
211    231    251 cO  271 11  311 E'  331 U`  351 e'  371 u` 
212    232    252 a-  272 o-  312 E^  332 U'  352 e^  372 u' 
213    233    253 <<  273 >>  313 E"  333 U^  353 e"  373 u^ 
214    234    254 -.  274 14  314 I`  334 U"  354 i`  374 u" 
215    235    255 --  275 12  315 I'  335 Y'  355 i'  375 y' 
216    236    256 tO  276 34  316 I^  336 PP  356 i^  376 pp 
217    237    257 mac 277 ??  317 I"  337 ss  357 i"  377 y"

1.2.12.3 Decimal Latin-1

128    144    160 nsp 176 ++  192 A`  208 DD  224 a`  240 dd 
129    145    161 !!  177 +-  193 A'  209 N~  225 a'  241 n~ 
130    146    162 c|  178 22  194 A^  210 O`  226 a^  242 o` 
131    147    163 ##  179 33  195 A~  211 O'  227 a~  243 o' 
132    148    164 cur 180 @''  196 A"  212 O^  228 a"  244 o^ 
133    149    165 y-  181 uu  197 A+  213 O~  229 a+  245 o~ 
134    150    166 ||  182 pil 198 AE  214 O"  230 ae  246 o" 
135    151    167 $$  183 ..  199 C,  215 xx  231 c,  247 // 
136    152    168 ""  184 ,,  200 E`  216 O/  232 e`  248 o/ 
137    153    169 cO  185 11  201 E'  217 U`  233 e'  249 u` 
138    154    170 a-  186 o-  202 E^  218 U'  234 e^  250 u' 
139    155    171 <<  187 >>  203 E"  219 U^  235 e"  251 u^ 
140    156    172 -.  188 14  204 I`  220 U"  236 i`  252 u" 
141    157    173 --  189 12  205 I'  221 Y'  237 i'  253 y' 
142    158    174 tO  190 34  206 I^  222 PP  238 i^  254 pp 
143    159    175 mac 191 ??  207 I"  223 ss  239 i"  255 y"

1.2.12.4 Hexadecimal Latin-1

 80    90    a0 nsp  b0 ++  c0 A`  d0 DD  e0 a`  f0 dd 
 81    91    a1 !!   b1 +-  c1 A'  d1 N~  e1 a'  f1 n~ 
 82    92    a2 c|   b2 22  c2 A^  d2 O`  e2 a^  f2 o` 
 83    93    a3 ##   b3 33  c3 A~  d3 O'  e3 a~  f3 o' 
 84    94    a4 cur  b4 @''  c4 A"  d4 O^  e4 a"  f4 o^ 
 85    95    a5 y-   b5 uu  c5 A+  d5 O~  e5 a+  f5 o~ 
 86    96    a6 ||   b6 pil c6 AE  d6 O"  e6 ae  f6 o" 
 87    97    a7 $$   b7 ..  c7 C,  d7 xx  e7 c,  f7 // 
 88    98    a8 ""   b8 ,,  c8 E`  d8 O/  e8 e`  f8 o/ 
 89    99    a9 cO   b9 11  c9 E'  d9 U`  e9 e'  f9 u` 
 8a    9a    aa a-   ba o-  ca E^  da U'  ea e^  fa u' 
 8b    9b    ab <<   bb >>  cb E"  db U^  eb e"  fb u^ 
 8c    9c    ac -.   bc 14  cc I`  dc U"  ec i`  fc u" 
 8d    9d    ad --   bd 12  cd I'  dd Y'  ed i'  fd y' 
 8e    9e    ae tO   be 34  ce I^  de PP  ee i^  fe pp 
 8f    9f    af mac  bf ??  cf I"  df ss  ef i"  ff y"

1.2.13 ASCII with easy French conventions

This charset is identical to ascii, save for French diacritics which are noted using a slightly different convention.

See See section Easy French conventions for more details.

1.3 Easy French conventions

These conventions are used in texte and latexte charsets, which are seven bits codes. At text entry time, these conventions provide a little speed up. At read time, they slightly improve the readability. Of course, it would better to have a specialized keyboard to make direct eight bits entries and fonts for immediately displaying eight bit ISO Latin-1 characters. But not everybody is so fortunate. In several mailing environment, the eight bit is often willfully destroyed (an horrible Crime that most people do not care to straighten up).

See:

1.3.1 French quotes

French quotes (sometimes called “angle quotes”) are noted the same way English quotes are noted in TeX, id est by `` and ''.

1.3.2 Latin ligatures

No effort has been put to preserve Latin ligatures (ae, oe) which are representable in several other charsets. So, these ligatures may be lost through Easy French conventions.

1.3.3 Diacritics

This is almost the French convention for simplified diacritics entry:

e': Acute accent
e`: Grave accent
e^: Circumflex accent
e": Diaeresis
c,: Cedilla

In some countries, : is used instead of " to mark diaeresis. ‘recode’ support one convention on a single call, depending on the -c option of the recode command.

The convention is prone to loosing information, because the diacritic meaning overloads some characters that already have other uses. To alleviate this, some knowledge of the French language is insufflated into the recognition routines. So, the following subtleties are systematically obeyed by the various recognizers.

A single quote which follows a e does not necessarily means an acute accent if it is followed by a single other one. For example:

e'

will give an e with an acute accent.

e''

will give a simple e, with a closing quotation mark.

e'''

will give an e with an acute accent, followed by a closing quotation mark.

There is a problem induced by this convention if there are English citations with a French text. In sentences like:

There’s a meeting at Archie’s restaurant.

the single quotes will be mistaken twice for acute accents. So English contractions and suffix possessives could be mangled.
A double quote or colon, depending on -c option, which follows a vowel is interpreted as diaeresis only if it is followd by another letter. But there are in French several words that end with a diaeresis, the program also recognizes them.
See See section List of words ending with diaeresis for a study of all the problematic cases.
A comma which follows a c is interpreted as a cedilla only if it is followd by one of the vowels a, o and u.

1.3.4 List of words ending with diaeresis

Here is a classification of all cases of a diaeresis at the end of a French word:

Words ending in “igue”
- - Feminine words without a relative masculine:
```
besaigue" cigue"
```
- - Feminine words with a relative masculine: (1)
```
aigue" ambigue" contigue" exigue" subaigue" suraigue"
```
Words not ending in “igue”
- - Ended by “i”: (2)
```
ai" congai" goi" hai"kai" inoui" sai" samurai" thai" tokai"
```
- - Ended by “e”:
```
canoe"
```
- - Ended by “u”: (3)
```
Esau"
```

Notes:

There are supposed to be seven words in this case. So, one is missing.
Look at the following sentence:

"Ai"e! Voici le proble‘me que j’ai"

or, using the -c option:

Ai:e! Voici le proble‘me que j’ai:

There is an ambiguity between an ai", the small animal, and the indicative future of avoir (first person singular), when followed by what could be a diaeresis mark. Hopefully, the case is solved by the fact that an apostrophe always precedes the verb and almost never the animal.
I did not pay attention to proper nouns, but this one showed up as being fairly evident.

Just to complete this topic, note that it would be wrong to make a rule for all words ending in “igue” as needing a diaerisis. Here are counter-examples:

becfigue be`sigue bigue bordigue bourdigue brigue contre-digue
digue d'intrigue fatigue figue garrigue gigue igue intrigue
ligue prodigue sarigue zigue

1.3.5 When, How and Who.

Easy French has been in use in France for a while. Loic Dachary <loic@design.axis.fr> first exposed me to this particular convention. I only slightly adapted it (the diaeresis option) to make it more comfortable to several usages in Que’bec originating from Universite’ de Montre’al.

In fact, the main problem for me was not to necessarily to invent Easy French, but to recognize the “best” convention to use, (best is not being defined, here) and to try to solve the main pithfalls associated with the selected convention. I’m particularily grateful to Claude Goutier <6@cc.umontreal.ca> whom, through numerous discussions in August 1988, was quite helpful in evaluating various hypothesis.

1.4 Internal aspects

This information is organized in:

1.4.1 Overall organization

The main driver has a table giving the conversion routines available and for each, the starting charset and the ending charset. It then tries to figure out the shortest sequence of conversions that will transform the input charset into the final charset. Let us consider these charsets as being the nodes of a directed graph. ‘recode’ has internally a few elementary recoding methods, called single-steps, each of which may be considered as oriented arc from one node to the other. A cost is attributed to each single-step. Given a starting code and a goal code, ‘recode’ computes the most economical route through the elementary recodings.

The main part of ‘recode’ is written in C, as are most single-steps. A few single-steps which need to recognize sequences of multiple characters are written in ‘lex’.

1.4.2 Internal vs external piping

Suppose that four elementary steps are selected at path optimization time. Then ‘recode’ will split itself into four different tasks interconnected with pipes, logically equivalent to:

step1 <input | step2 | step3 | step4 >output

1.4.3 Some limitations

Here are some limitations of the program.

There is a limit (currently 10) on the number of steps allowed in one single recodification work. It should stay sufficient for quite a while, maybe for ever. This is a simple compilation #define, in any case.

1.4.4 Adding new charsets

It is fairly easy for a programmer to add a new charset to ‘recode’. All it requires is making two routines, modifying a few tables, and makeing ‘recode’ again.

One of the routine should convert from any previous charset to the new one. Any previous charset will do, but try to select it so you will not loose too much information while converting. If you have to read multiple bytes of the old charset before recognizing the character to produce, you might write this routine in ‘lex’; otherwize, use C. Prototype your routine after one of those which exists, so to keep the sources uniform.

The other routine should convert from the new charset to any older one. You do not have to select the same old charset than what you selected for the previous routine. Select any charset for which you will not loose too much information while converting. If the routine has to read multiple bytes of the new charset before deciding which character it will produce, you might write this routine in ‘lex’; otherwize, use C. Prototype your routine after one of those which exists, so to keep the sources uniform.

Edit ‘Makefile’ to add the object name of your two routines to the C_STEPS or L_STEPS macro definition, depending on the fact your routines is written in C or in ‘lex’. Then edit ‘steps.h’ in the four following places:

Create a symbol for your new charset in enum TYPE_code definition.
Add the option name of your new charset in code_keywords initialization.
Add two extern declarations for your routines at the appropriate places.
Add two lines in single_steps array initialization to declare your routines. For each line, include the four following fields:
1. The function name of your routine.
2. The starting code enum constant, that is, the code your routine reads.
3. The goal code enum constant, that is, the code your routine produces.
4. The cost of your routine, using the predefined constants STEP, LOOSE, EXACT, SLOW and FAST. See the comments for the exact meaning of each of these and follow the examples. Respect these meanings and be honest with the costs!
In some circumstances, one of your routines would be a mere copy. It is better in this case to not provide the routine, but still declare it in single_steps using NULL as its function name and ALREADY alone as its cost.

1.5 Future things

I will be glad to hear critics and suggestions, even for details. This program is made up of hundreds of details, in fact. Write to pinard@iro.umontreal.ca.

Some notes and suggestions.

Accept abbreviations for charsets on the command call. Accept more than one conversion with intermediate filters in a single call.
Support Universite de Montreal “accent” convention.
Support [nt]roff diacritics.
Support the Atari-ST internal code.
Segregate charsets and usages.
Is there some way of specifying that recode should not contract something that looks like an accent? Like "There\’s a meeting at Archie\’s restaurant"? (With corresponding insertion of backslashes or whatevers when converting the other way, of course - the transformation from accented to ascii should be exactly invertable in all cases.) Of course, There\’s will not be contracted.

About This Document

This document was generated on December 8, 2024 using texi2html 5.0.

The buttons in the navigation panels have the following meaning:

Button	Name	Go to	From 1.2.3 go to
[ << ]	FastBack	Beginning of this chapter or previous chapter	1
[ < ]	Back	Previous section in reading order	1.2.2
[ Up ]	Up	Up section	1.2
[ > ]	Forward	Next section in reading order	1.2.4
[ >> ]	FastForward	Next chapter	2
[Top]	Top	Cover (top) of document
[Contents]	Contents	Table of contents
[Index]	Index	Index
[ ? ]	About	About (help)

where the Example assumes that the current position is at Subsubsection One-Two-Three of a document of the following structure:

1. Section One
- 1.1 Subsection One-One
  - ...
- 1.2 Subsection One-Two
  - 1.2.1 Subsubsection One-Two-One
  - 1.2.2 Subsubsection One-Two-Two
  - 1.2.3 Subsubsection One-Two-Three <== Current Position
  - 1.2.4 Subsubsection One-Two-Four
- 1.3 Subsection One-Three
  - ...
- 1.4 Subsection One-Four